Introduction to supervised text classification

Malo Jan

2025-01-06

Overview

  • Last two days : how to get text data and obtain numerical representations of texts
  • Today : how to use these numerical representations to automate the classification of texts into categories, one of the main use cases of text analysis in social sciences
  • Supervised text classification : what is it, how to do it, how to evaluate it

Supervised text classification : uses and workflow

Text classification as measurement in social sciences

  • Long tradition in social sciences to measure/operationalize concepts from text
  • Content analysis : assign predefined categories to texts units to measure a social phenomena (Krippendorff 2018)
  • Systematic measurement of (latent) concepts in texts, through coding rules and quantification
  • Different from more qualitative/intepretative approaches of text analysis which focus on meaning, context, interpretation

Text classification through manual content analysis

Advantages and limits of manual content analysis

  • Importance of human judgment, context-knowledge

  • High quality data, in comparison to automated methods such as dictionnaries

  • Highly time consuming, human labor intensive and costly

    • Eg : Manifesto Project, CAP
  • Often need to rely on a small sample of texts, which can be biased

    • Eg. Protest event Analysis

Let the machine come in

  • Text classification can be automated with supervised machine learning
  • One of the main use cases of text analysis in social sciences
  • “Augment” the human, does not replace it
  • Rather than coding all the texts manually, we can train a model to predict the categories based on a sample that we have coded
  • Supervised : we need a labeled dataset to train the model
  • Unsupervised : we do not need a labeled dataset, the model will learn the categories by itself
  • The model will learn the text features that are associated with the categories

Some common machine learning terms explained

From van Atteveldt et al, 2022
Machine Learning Lingo Statistics Lingo
Feature Independent variable
Label Dependent variable
Labeled dataset Dataset with both independent and dependent variables
To train a model To estimate
Classifier (classification) Model to predict nominal outcomes
To annotate To (manually) code (content analysis)

Inputs and outpus in supervised learning for text analysis

  • Input : text features + labels
  • Model : a function that will learn the relationship between the text features and the labels
  • Ouput :
    • Probabilities of belonging to each category
    • If two categories, a binary classification, usually > 0.5
    • If more than two categories, a multi-class classification, class assign is highest probability

Uses cases

Supervised text classification pipeline

  1. Getting a clean corpus of labelled texts
  2. Transforming texts into numerical features
  3. Model training
  4. Model evaluation
  5. Inference on full corpus
  6. Validation
  7. Use in downstream analysis

Licht et al. (2024) : A supervised learning workflow

Do, Ollion, and Shen (2024) : Policy vs Politics classification task

Getting a clean corpus of labelled texts

Getting a clean corpus of labelled texts

  • Training a “classifier” requires labelled data : texts with categories assigned
  • Labels are used to :
    • Train the model
    • As a gold standard to evaluate the performance of the model
  • Existing labels
    • Potential labels with similar categories from other projects
    • Labels in the wild : metadata from datasets/website
  • Most of the time, we need to label the data ourselves

Which texts to annotate ?

  • From a corpus of texts, we need to select a sample to annotate
  • Series of decisions to make
  • Text unitization : documents, paragraphs, sentences, tokens ?
  • Sample size : how many texts to annotate ?
    • Usually the more the better, to a certain extent
    • But depends on number of categories, complexity of the task
  • Sampling strategy : random, stratified ?
    • Important to have a representative sample
  • Active learning : select algorithmically the most informative texts to annotate

How to annotate ?

  • Coding rules, codebook, to be refined
  • Quality of annotation is super important : garbage in, garbage out
  • Has to be systematic, replicable
  • Get to know the data
  • Ideal case (often needed to publish) : several annotators, inter-coder reliability tests (Cohen’s Kappa, Krippendorf alpha, etc.)

Who annotates ?

  • Most of text classification projects rely on RAs or crowdworkers
    • Trained on the codebook and paid for their work
  • As phd students, we often do it ourselves
    • Time consuming, but we know the data better

Can LLms annotate for you ?

Gilardi, Alizadeh, and Kubli (2023)

Ollion et al. (2023)

Tan et al. (2024)

Alizadeh et al. (2025)

Transforming texts into numerical features

Transforming texts into numerical features

  • Texts are not directly usable by machine learning algorithms
  • Need to convert them into numerical features
  • Different possibilities :
    • Bag-of-words (BOW)
    • Term frequency-inverse document frequency (TF-IDF)
    • Static word embeddings (Word2Vec, GloVe)
    • Contextual word embeddings (eg. BERT)

Model training

Supervised learning models

  • Split the annotated sample into a training set and a test set (70/30 rule)

  • A supervised learning model is a model that learns a mapping function from input features to output labels based on a training set

  • The model “learns” the underlying relationship between the input features and the labels by minimizing a loss function that measures the difference between the predicted labels and the true labels

  • During the training phase, the model adjusts its parameters to minimize the loss function and make better predictions

  • Models differ in the way they learn this mapping function

  • Function then used to predict the labels of the test set, giving us a pro

  • Return the probability of a text belonging to a category

Different types of models

  • Classical models : Logistic regression, Naive Bayes, SVM, Random Forest
    • Simple and fast
    • Rely on bag-of-words representations
    • Lot of preprocessing needed
    • Limited performance
  • Transformers models (eg: BERT)
    • Contextual representation of texts
    • No need for extensive preprocessing
    • Transfer-learning paradigm : models have already language knowledge
    • Computationally intensive

Model evaluation

Evaluating performance of the model

  • Use the test set to evaluate the performance of the model

    • Set of texts that the model has not seen but for which we know the categories
    • “Held-out” data/Gold standard
  • The model is used to predict the categories of the texts in the test set

  • The predictions are compared to the true categories with different metrics

  • Accuracy : proportion of correctly classified texts (highly limited for imbalanced datasets)

\[ \ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \ \]

Confusion matrix

  • Positive class : the class we want to predict
  • Negative class : the other class
Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

Recall, Precision and f1-score

  • Recall : proportion of actual positive cases that were correctly classified

\[ \ \text{Recall} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}} \ \]

  • Precision : proportion of predicted positive cases that were correctly classified

\[ \ \text{Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}} \ \]

\[ \ \text{f1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \ \]

Recall and precision trade-offs

  • Recall and precision are often in trade-off
    • Increasing recall involves predicting more positive cases at the risk of increasing false positives (reduce precision)
    • Increasing precision involves predicting less positive cases at the risk of missing some positive cases (reduce recall)
  • The model may be optimized to maximize one of the two, depends on the task
    • Eg. in a spam detection task, we want to maximize precision : we prefer to have some spam in our inbox rather than missing an important email
    • Content moderation/hate speech : we often want to maximize recall : we prefer to have some false positives rather than missing a hate speech
  • We can change the threshold of the model to optimize recall or precision (ROC)

What is a good performance ?

  • Maximization of one metric, but depends on the task
  • In general, we look for at least f1-score > 0.7
  • A “good” performance depends on the complexity of the task
    • Conceptual complexity
    • Task complexity : multi-class, imbalanced classes
  • Performance also vary across class
  • But, sometimes we do not want our model to be too good neither : overfitting

How to improve performance

Problem Solution
Unbalanced classes Undersampling & oversampling
Not enough training data More annotation
Bad quality of the training data Better annotation
Bad quality of the text features Better preprocessing
Limited text representation Go for more complex models
Too complex concept Accepting okay-ish performance

Inference and validation

Inference

  • Once a model is trained and evaluated, it can be used for inference
  • For a supervised learning model, the inference is the prediction of the categories for new (unseen) texts
  • Corpus of 1 million texts, model trained on 1000, use the model to classify all the remaining texts
  • Important to consider the generalization capacity of the model on new data
    • Out-of-domain texts : texts that are very different from the training texts
    • Out-of-language texts : texts in a language different from the training language
  • If texts are a bit different, may be important to compare predictions with human annotations

Measurement validity

  • Face validity
  • Convergence validity

Benchmark with other methods

  • Dictionnary-based methods

What to do with classification outputs ?

What to do classification outputs ?

  • It is (mostly) all about measurement of previously unmeasured concepts
  • 4 main uses of classification outputs :
    • Descriptive variation of concept prevalence across time, countries, groups
    • Use of prediction quality as a measure of concepts
    • Use of classification output as dependent variable in a regression
    • Use of classification output as independent variable in a regression

Descriptive variation of concept prevalence

Do, Ollion, and Shen (2024)

Use prediction quality as a measure of the concept


Peterson and Spirling (2018) : accuracy as a measure of polarization

Classification output as DV

Licht et al. (2024) : predicting the use of anti-elite strategies

Classification output as IV

Sattelmayer forthcoming : the effect of party position on immigration on vote switching to the far right

References

Alizadeh, Meysam, Maël Kubli, Zeynab Samei, Shirin Dehghani, Mohammadmasiha Zahedivafa, Juan D Bermeo, Maria Korobeynikova, and Fabrizio Gilardi. 2025. “Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning.” Journal of Computational Social Science 8 (1): 1–25.
Bonikowski, Bart, Yuchen Luo, and Oscar Stuhler. 2022. “Politics as Usual? Measuring Populism, Nationalism, and Authoritarianism in US Presidential Campaigns (1952–2020) with Neural Language Models.” Sociological Methods & Research 51 (4): 1721–87.
Burnham, Michael. 2024. “Stance Detection: A Practical Guide to Classifying Political Beliefs in Text.” Political Science Research and Methods, 1–18.
Burscher, Bjorn, Rens Vliegenthart, and Claes H De Vreese. 2015. “Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize Across Contexts?” The ANNALS of the American Academy of Political and Social Science 659 (1): 122–31.
Burst, Pola AND Franzmann, Tobias AND Lehmann. 2024. “Manifestoberta. Version 56topics.sentence.2024.1.1.” Berlin / Göttingen: Wissenschaftszentrum Berlin für Sozialforschung / Göttinger Institut für Demokratieforschung. https://doi.org/10.25522/manifesto.manifestoberta.56topics.sentence.2024.1.1.
Do, Salomé, Étienne Ollion, and Rubing Shen. 2024. “The Augmented Social Scientist: Using Sequential Transfer Learning to Annotate Millions of Texts with Human-Level Accuracy.” Sociological Methods & Research 53 (3): 1167–1200.
Gilardi, Fabrizio, Meysam Alizadeh, and Maël Kubli. 2023. “ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks.” Proceedings of the National Academy of Sciences 120 (30): e2305016120.
Krippendorff, Klaus. 2018. Content Analysis: An Introduction to Its Methodology. Sage publications.
Licht, Hauke, Tarik Abou-Chadi, Pablo Barberá, and Whitney Hua. 2024. “Measuring and Understanding Parties’ Anti-Elite Strategies.”
Licht, Hauke, and Ronja Sczepanksi. 2024. “Who Are They Talking about? Detecting Mentions of Social Groups in Political Texts with Supervised Learning.” ECONtribute Discussion Paper.
Müller, Stefan, and Sven-Oliver Proksch. 2024. “Nostalgia in European Party Politics: A Text-Based Measurement Approach.” British Journal of Political Science 54 (3): 993–1005.
Ollion, Etienne, Rubing Shen, Ana Macanovic, and Arnault Chatelain. 2023. “ChatGPT for Text Annotation? Mind the Hype.”
Peterson, Andrew, and Arthur Spirling. 2018. “Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems.” Political Analysis 26 (1): 120–28.
Tan, Zhen, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. 2024. “Large Language Models for Data Annotation and Synthesis: A Survey.” In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 930–57.